Towards Efficient Positional Inverted Index †
نویسندگان
چکیده
We address the problem of positional indexing in the natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original text file, which implies that it is not necessary to store the original file. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. Inverted lists of single terms are combined into one inverted list that represents the backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with a presentation layer that stores separators, stop words, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. The space complexity of the PISI inverted list is O((N − n)dlog2b Ne+ (bN−n α c+ n)× (dlog2b ne+ 1)) where N is a number of stems, n is a number of unique stems, α is a step/period of the back pointers in the inverted list and b is the size of the word of computer memory given in bits. The space complexity of the presentation layer is O(−∑ i=1dlog2 p n(i) i e −∑ ′ j=1dlog2 pje+ N) with respect to p n(i) i as a probability of a stem variant at position i, pj as the probability of separator or stop word at position j and N ′ as the number of separators and stop words.
منابع مشابه
Positional Data Organization and Compression in Web Inverted Indexes
To sustain the tremendous workloads they suffer on a daily basis, Web search engines employ highly compressed data structures known as inverted indexes. Previous works demonstrated that organizing the inverted lists of the index in individual blocks of postings leads to significant efficiency improvements. Moreover, the recent literature has shown that the current state-of-the-art compression s...
متن کاملPhrase Queries with Inverted + Direct Indexes
Phrase queries play an important role in web search and other applications. Traditionally, phrase queries have been processed using a positional inverted index, potentially augmented by selected multiword sequences (e.g., n-grams or frequent noun phrases). In this work, instead of augmenting the inverted index, we take a radically different approach and leverage the direct index, which provides...
متن کاملIntra-Positional and Inter-Positional Differences in Somatotype Components and Proportions of Particular Somatotype Categories in Youth Volleyball Players
Objective(s). Main aim of this cross-sectional study was to analyse intra-positional, inter-positional differences in proportions of particular somatotype categories in youth volleyball players. Methods. Heath-Carter method was used to determine somatotype characteristics of 181 young female volleyball players (age 14.05±0.93, height 170.03±7.61 cm, mass 57.80±8.59 kg, bod...
متن کاملTowards Efficient SPARQL Query Processing on RDF Data
Efficient support for querying large-scale RDF triples plays an important role in Semantic Web data management. This paper proposes an efficient RDF query engine to evaluate SPARQL queries, where the inverted index structure is employed for indexing RDF triples. We first design and implement a set of operators on the inverted index for query optimization and evaluation. Then we propose a main-t...
متن کاملWindow Extraction for Information Retrieval
Proximity-based term dependencies have been proposed and used in a variety of effective retrieval models. The execution of these dependency models is commonly supported through the use of positional inverted indexes. However, few of these models detail how instances of proximate terms should be extracted from the lists of positional data. In this study, we investigate three algorithms for the e...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Algorithms
دوره 10 شماره
صفحات -
تاریخ انتشار 2017